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Abstract 

Because the presence of views enhances query performance, materialized views are in- 
creasingly being supported by commercial database/data warehouse systems. Whenever 
the data warehouse is updated, the materialized views must also be updated. However, 
whereas the amount of data entering a warehouse, the query loads, and the need to obtain 
up-to-date responses are all increasing, the time window available for making the warehouse 
up-to-date is shrinking. These trends necessitate efficient techniques for the maintenance of 
materialized views. 

In this paper, we show how to find an efficient plan for maintenance of a set of views, by 
exploiting common subexpressions between different view maintenance expressions. These 
common subexpressions may be materialized temporarily during view maintenance. Our 
algorithms also choose subexpressions/indices to be materialized permanently (and main- 
tained along with other materialized views), to speed up view maintenance. While there 
has been much work on view maintenance in the past, our novel contributions lie in ex- 
ploiting a recently developed framework for multiquery optimization to efficiently find good 
view maintenance plans as above. In addition to faster view maintenance, our algorithms 
can also be used to efficiently select materialized views to speed up workloads containing 
queries. 

1 Introduction 

Materialization of views can help speed up query and update processing. Views are especially 

attractive in data warehousing environments because of the query intensive nature of data 

warehouses. However, when a warehouse is updated, the materialized views must also be 

updated. Typically, updates are accumulated and then applied to a data warehouse. Loading 
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of updates and view maintenance in warehouses has traditionally been done at night. While 
the need to provide up-to-date responses to an increasing query load is growing and the amount 
of data that gets added to data warehouses has been increasing, the time window available for 
making the warehouse up-to-date has been shrinking. These trends call for efficient techniques 
for maintaining the materialized views as and when the warehouse is updated. 

Given multiple views, the view maintenance problem can be seen as computing the expres- 
sions corresponding to the "delta" of the views, given the "delta" s of the base relations that are 
used to define the views. The contributions of this paper lie in the exploitation of the Multi- 
Query Optimization (MQO) framework along with our recently developed efficient algorithms 
for MQO, to compute the delta expressions corresponding to the multiple views defined in a 
data warehouse. 

It is not difficult to motivate that query optimization techniques are important for choosing 
an efficient plan for maintaining a view. For example, consider the expression (A M B) N C, 
where A, B and C are multisets (i.e., relations with duplicates). Given that the multiset of 
tuples 5~c is inserted into C, the change to the view is given by (A N B) N 5q. This expression 
can equivalently be computed as (A N 6^,) N B and by (B N S^,) N A, one of which may be 
substantially cheaper to compute. Further, in some cases the view may be best maintained 
by recomputing it, rather than by finding the differentials as above. Vista | Vis9£jl describes 



how to extend the Volcano query optimizer |GM91] to choose the best plan for computing the 
differential of the result of an expression. 



Given a set of queries, multiquery optimization [Sel88] provides the possibility of reducing 
costs by computing shared subexpressions once, materializing them temporarily, and reusing 
them where required in the given set of queries. Although multiquery optimization was earlier 



viewed as expensive, our recent work [ RSSBOOfl has provided efficient algorithms for multiquery 



optimization, making it practical. In this paper, we provide practical solutions to the problem 
of optimizing the update of a set of materialized views, by exploiting these algorithms. 
Specifically, our contributions are as follows. 

1. We extend the multiquery optimization algorithms to find the best plan for computing 
the differential of a set of expressions, by exploiting shared subexpressions. 

Sharing of subexpressions occurs when multiple views are being maintained, since related 
views may share subexpressions, and as a result the maintenance expressions may also 
be shared. Furthermore, sharing can occur even within the plan for maintaining a single 
view, as we illustrate later in the paper. 

Our algorithms choose shared expressions to be temporarily materialized during view 
maintenance, and choose view maintenance plans that utilize these temporarily material- 
ized results. 

2. Just as the presence of views allows queries to be evaluated more efficiently, the mainte- 
nance of these views can be made more efficient by the presence of additional views /indices 
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[R.SS96]. That is, given a set of materialized views to be maintained, we need to choose 
what additional indices and views should be materialized to minimize overall view main- 
tenance costs. 

The choice of additional views must be done in conjunction with choosing plans for main- 
taining the views. For instance, a plan that seems quite inefficient could become the best 
plan if some intermediate result of the plan is chosen to be materialized and maintained. 

Our contribution here is to show how to extend the multiquery optimization algorithms of 
RSSBOCfl to tackle the problems of selecting permanent materialized views, in conjunction 



with choosing the best plans for updating the views. 

We show how to cleanly integrate the choice of expressions/indices to be permanently materi- 
alized, with the choice of expressions/indices to be temporarily materialized. 

It is worth pointing out that although our focus in this paper is to speed up view main- 
tenance, our algorithms can also be used to choose extra temporary and permanent views in 
order to speed up a workload containing queries and updates (that trigger view maintenance) . 

There has been much earlier work on choosing a set of views to be materialized and main- 
tained to optimize given workloads of queries and updates. The major differences between our 
work and earlier work can be summarized as follows (we outline the differences in detail in 
Section §): 

1. Given a set of related materialized views, temporarily materializing common subexpres- 
sions could have significant benefit. However, earlier work did not consider how to exploit 
common subexpressions by temporarily materializing them because of their focus on per- 
manent materialization and common subexpressions involving differential relations cannot 
be permanently materialized. 

2. The earlier work does not cover efficient techniques for the implementation of materialized 
view selection algorithms, in particular, their integration with query optimizers. In the 
context of materialized view maintenance, this is an important problem since the cost of 
view maintenance can be reduced by the presence of (additional) indices on relations, and 
of appropriate extra materialized views. 

In contrast, we show how to efficiently choose views/indices to be (permanently) materi- 
alized by extending the multiquery optimization algorithms of [ RSSBOCfl . 



The rest of the paper is organized as follows. We outline related work in Section ^ and 
provide the reader with some background in view maintenance in Section ||. We describe the 
DAG structure used to represent queries in Section [|, and algorithms to find optimal update 
plans (without materializing additional views) in Section |5| Section || presents an optimized 
greedy algorithm for selecting extra views for materialization. Section fj] outlines results of a 
performance study, and Section || concludes the paper. 
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2 Related Work 

There has been a large volume of research on incremental view maintenance in the past decade. 
Amongst the early work on computing the differential results of operations/expressions was 
Blakeley et al. jBCL86|| . More recent work in this area includes jGL95| , |CGL + 96| , |MQM97 |. 
Gupta and Mumick [GM95| provide a survey of view maintenance techniques. 

Blakeley et al. [BCL86] and Ross et al. [ RSS96 ] noted that the computation of the expres- 
sion differentials has the potential for benefiting from multiquery optimization. In the past, 
multiquery optimization was viewed as too expensive for practical use. As a result they did 
not go beyond stating that multiquery optimization could be useful for view maintenance. Our 
recent work in RSSBOOfl provides efficient heuristic algorithms for multiquery optimization, and 
demonstrates that multiquery optimization is feasible and effective. 

There has been much work on selection of views to be materialized. One notable early 



work in this area was by Roussopolous Rou82 |. Ross et al. [RSS96] considered the selection of 
extra materialized views to optimize maintenance of other materialized views/assertions, and 
mention some heuristics. Labio et al. [LQA9?]] provide further heuristics. The problem of 
materialized view selection for data cubes has seen much work, such as |HRU9£], who propose 
a greedy heuristic for the problem. Gupta [ Gup97 ] and Gupta and Mumick ]GM9£ ] extend 
some of these ideas to a wider class of queries. However, (a) none of the above papers consider 
implementation details that are important for efficient selection of views, and (b) none of these 
consider how to optimize view maintenance expressions. 

Vista [ Vis98| ] describes how to extend the Volcano query optimizer to optimize view main- 
tenance. However, she does not consider the materialization of expressions, whether temporary 
or permanent. Optimizations that exploit knowledge of foreign key dependencies can be used 
to detect that certain join results involving differentials will be empty [ |QGMW96 , |VisS 



Roy et al. [ |RSSB00 | consider how to perform multiquery optimization by selecting subex- 
pressions/indices for temporary materialization. They present important optimizations of a 
greedy heuristic for materialized view selection that makes the heuristic practical. However, 
they do not consider updates or view maintenance, which is the focus of this paper. We utilize 
the optimizations proposed in RSSBOC ], but the extensions required to to take update costs 
into account, and optimize view maintenance expressions, are non-trivial. 

There has been earlier work on multiquery optimization, including | Scl88| , 5SN94 , SG9Cfl 
and more recently [ |5V98 |, but none of these consider updates. 

3 Background and Motivation 

We assume that updates (inserts/deletes) to relations are logged in corresponding delta relations, 
which are made available to the view refresh mechanism. We assume for each relation r, there 
are two relations and 5~ denoting, respectively, the (multiset of) tuples inserted into and 
deleted from the relation r. 

The view refresh mechanism is invoked as part of an update transaction for immediate 
update, or periodically for deferred updates. In the second case, updates performed by many 
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transactions may be collected together in the delta relations <5+ and 5 r for a relation r. Our 
techniques work regardless of whether the updates are immediate or deferred. 

3.1 Computing the Differential of an Operation 

There is a considerable amount of literature on computing differentials of operations, as outlined 
in Section ^. For completeness, we briefly review techniques for computing the differential of 
an operation in the multiset relational algebra. 

3.1.1 Differentials of Joins 

Consider a multiset join AMB, and suppose A and B are updated by inserting the multisets 
of tuples 5\ and 5g respectively. Let A old and B old refer to the old values of A and B, that 
is their contents before the update. The multiset of tuples that get added to the view V are 
denoted by Sy, and can be computed as: 

6+ = (5+ N B old ) U (A old N 5+) U (S% H 6+) 

View V is then updated as follows: V <— V U 8y 

Similarly, if tuples 5^ and 5g are deleted from A and B respectively, the multiset of tuples 

Sy = (52 X B old ) U (A old M<5 B )U(^M,J B ) 

get deleted from V, which is then updated by: V <— V — Sy. 

Updates can be modeled as deletes followed by inserts. If both inserts and deletes are 
present on a relation, we get a more complex expression for updating the relation. 

V ^VU {A old M 8+) u (&X N B old ) U (&X ^ $b) ~ ( A ° ld M s b) ~ ( S A M B ° ld ) u ( 5 A N 5 b) 

In contrast if only one input, say A, is updated by only insertion the change in the view is 
much easier to compute: 

5+ = g% N B old 

and similarly for deletions on A, 

5 V = 5 A M B old 

To keep expressions simple (and for another reason which we describe later in Section [3.2.3 ) 
we assume that updates are propagated one relation at a time, and only one type of update at 
a time. This simply means we compute the effect of all inserts on A, then update A with 5\, 
then compute the effects of all deletes on A, update A with <5T. We then proceed with inserts 
to B, and then with deletes from B. 

The net result is the same as if the more complicated expressions are used, but the expres- 
sions we need to deal with are much simpler. Note that an operation may have two complex 
expressions as inputs, and if both use a particular relation, even with this restriction there may 
be differential results on both its inputs, in which case the more complex expression allowing 
differentials on both inputs must be used. 
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3.1.2 Differentials of Other Operations 

If the result of a groupby /aggregate operation, such as aG count{B){E) , has been materialized 
(and the aggregate function is distributive), the change in the aggregate result can be computed 
using only the changes (5g and 5g) to the input E, and the old result of the aggregation^] The 
group-by /aggregation operation is executed on the tuples from the delta relations, and the 
results are merged into the existing materialized view using a merge operation. For more 



details, see, e.g., | GM95 |, and for extensions to operations such as median, see [RSSS9J]. 

To compute the differential result of an aggregate/grouping operation whose result has not 
been materialized, we would have to recompute the aggregate values for all groups which are 
affected by the update. This may involve significant extra work.0 

Standard techniques are available for computing the differentials of other operations, such 



as duplicate elimination (and projection), and outer joins [|GL95| , GJM97]. We omit details but 



note that we can use these techniques without any changes to our optimization algorithms. 
3.2 Computing the Differential of an Expression 

Views are defined by potentially complex expressions, hence we need to find the differential of 
an entire expressions. 

3.2.1 Generating a Differential Expression 

Techniques, such as that of | GL95fl can be used to generate an expression that computes the 



differential of a given expression. However, the resultant expression can be very large - expo- 
nential in the size of the query. For instance consider the view V = A N B M C, with inserts 
on all three relations. The differential in the result of V can be computed as 

(i+MBMC) U (A N (5+ M C) U (A M B M 5+) U (A N <5+ XI 5+) U 
(5\ M B N £+) U {5\ N 8% M C) U {5\ N 5+ M 6+) 

The size of this expression is exponential in the number of relations. Optimizing such large 
expressions can be quite expensive, since query optimization is exponential in the size of the 
expression. There are many common subexpressions in the above expression, and the above 
expression could be simplified by factoring, to get: 

(5+NSMC) U ((A U <5^) N 6+ M C) U {{A U 5~a) M(SU Sg) N Sq) 
But creating simplified forms of differential expressions is difficult with more complex expres- 
sions containing operations other than join. 

Therefore our algorithms use an alternative technique, which we outline in the next section. 

For some operations like average, the count of tuples in each group must also be materialized. Even for the 
sum operation, the count of tuples is needed to deal with deletions. 



2 There are techniques, such as [CGL + 96, MQM97], that use differentials of more complex forms, such as 



changes in the value of an aggregate result, to avoid recomputing aggregate values in some special cases. Our 
techniques can be extended to deal with such differentials, but for simplicity we do not consider such techniques 
here. 
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3.2.2 Propagating Updates Up An Expression 

An alternative to generating a differential expression is to propagate differentials up an expres- 



sion [Rou82, RSS9C]. Propagation is best understood by visualizing an expression as a tree. 
The differential of a node in the tree is computed using the differential (and if necessary, the 
old value) of its inputs, as described earlier. We start at the leaves of the tree, and proceed 
upwards, computing the differential expressions at each node. 

For example, consider an expression (A N B) M C), and suppose we wish to propagate 
inserts to A We can do so by first computing the differential of the node A N B as 8\ N B. 
We then join this differential with C, which is the other input of its parent node, to get the 
differential of the parent. 

As mentioned earlier, if there are updates to multiple relations, we propagate one type 
of update to one relation at a time. Doing so simplifies the expressions for computing the 



differentials, as outlined in Section 3.1, and permits a different evaluation plan to be chosen 



for each expression; this is essential for efficient view maintenance, as we will see next, in 



Section 3.2.2. 



The process of computing the differential of an expression can be expressed purely in terms 
of how to compute the differentials for each operation in the expression. There is no need to 
rewrite the entire expression. Note also that the procedure for computing differentials of an 
expression can be easily extended to handle expressions using new types of operations, so long 
as we have a way of incrementally computing the differential of the operation. 

3.2.3 The Role of Query Optimization 

Consider an expression A N (B M C), and suppose tuples are inserted into A. We can compute 
the differential of the result as 5~t N (B M C). If we compute this expression as shown, we 
would need to compute B N C, which does not involve any 8 relation, and hence may be large 
and expensive. A better way of evaluating the differential may be (<5^ N B) M C. Note that 
the two variants are logically equivalent. 

Thus, for efficient differential computation, query optimization must be applied to the up- 
date expressions to choose the cheapest variant, as proposed in [ Vis98|| . 



Furthermore, note that if we wish to compute the differential when tuples are inserted into 
C, the plan (8^ X B) XI A or (d£ XI A) X B may be preferable to (A X B) X 8^. Thus, using 
a single expression, such as (A X B) X C to propagate differentials to A, B and C is likely to 
perform badly for at least one of the differentials. 

For this reason, we propagate differentials of only one relation at a time, and choose a 
separate plan for each differential propagation. 

We use a query optimizer for choosing best plans for such propagation, and our optimizer 
uses a DAG representation that compactly represents all expressions equivalent to a given 
expression. Since all alternative expressions above are available, the best one can be chosen for 
the propagation of each differential. We present details in Sections and ||. 
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Note also that recomputation of a materialized view is always an alternative to computing 
the differential in its result and updating it. Thus, the query optimizer must choose recompu- 
tation over incremental view maintenance, if recomputation is cheaper. 

3.3 The Role of Multi-Query Optimization in View Update 

Multi-query optimization attempts at exploiting common sub-expressions within a query, or 
across queries in a batch of queries submitted together, to reduce the query evaluation cost. 
In the context of view update, sharing can occur across the tasks of computing differentials of 
different views, or even within the task of computing the differential of a single view, as we 
show below. 

It is easy to see that related queries may share subexpressions, and if so, it may be best 
to compute the shared subexpression once, materialize it, and reuse it. However, this decision 



must be done in a cost based manner, as the following example from [ RSSB00 | illustrates. 



Example 3.1 Let Q\ and Q2 be two queries whose locally optimal plans (i.e., individual best 
plans) are (R XI S) N P and (R N T) M S respectively. The best plans for Q\ and Q 2 do not 
have any common sub-expressions. However, if we choose the alternative plan (R N S) N T 
(which may not be locally optimal) for Q2, then, it is clear that R X S is a common sub- 
expression and can be computed once and used in both queries. This alternative with sharing 
of R N S may be the globally optimal choice. 

On the other hand, blindly using a common sub-expression may not always lead to a globally 
optimal strategy. For example, there may be cases where the cost of joining the expression 
R IX! S with T is very large compared to the cost of the plan (R XI T) M S; in such cases it may 
make no sense to reuse R M S even if it were available. □ 

In the context of view maintenance, if two materialized views have common subexpres- 
sions, as in the example above, the expressions for computing the differential of the common 
subexpression would also be shared. 

To illustrate subexpression sharing possibilities within a single view maintenance query, 
consider a view V defined as in the example below. 

Example 3.2 Let view V = A\A B \A C M D. Suppose there are inserts on all four relations. 
The differential of V can then be computed using 

(i+MBMC^D) U ((A U 8%) N 5+ IX C X D) U 

((AUSX) * (BU5%) M <J+ X D) U {{A U 8\) M (B U 4) M iP U 5%) M <5+) 
The above expression represents algebraically the effect of propagating differentials one at a 



time, as described in Section 3.2.2 



Note that there are several potential common subexpressions in the above expression. For 
instance, if the plans chosen for the first two terms of the above union are (<5^ N [B N (C N D))) 
and ({AU 5\) N {5^ IX (C M D))), then C IX D is a common subexpression of the two. If 
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(Commutativity not shown - every join node has 
another join node with inputs exchanged, below 
the same equivalence node) 

(a) Initial Query (b) DAG representation of query (c) Expanded DAG after transformations 

Figure 1: Initial Query and DAG Representations 

the above plans are chosen, the subexpression can be computed once and shared. Similarly, 
(A U 5~a) CXI (B U 5g) is potentially a common subexpression. 

The alternative plans of (((^ M B) N C) N D) and ((((A U 5\) M 8%) X C) M D) offer no 
sharing possibilities, but may still be cheaper. □ 

Which the above plans should be used, and whether the common subexpressions should 
materialized and shared is a decision for the multiquery optimizer to make in a cost-based 
manner. Thus, it is the job of the multiquery optimizer to find the best overall plan taking 
sharing possibilities into account. 

4 DAG Representation of Queries 

Our algorithms use an extended form of the DAG representation of queries used, for instance, 



in Volcano [GM91]. In this section we summarize the DAG representation and terminology 



from HRSSB00 |. 



An AND-OR DAG is a directed acyclic graph whose nodes can be divided into AND-nodes 
and OR-nodes; the AND-nodes have only OR-nodes as children and OR-nodes have only AND- 
nodes as children. 

An AND-node in the AND-OR DAG corresponds to an algebraic operation, such as the join 
operation (N) or a select operation (er). It represents the expression defined by the operation 
and its inputs. Hereafter, we refer to the AND-nodes as operation nodes. An OR-node in the 
AND-OR DAG represents a set of logical expressions that generate the same result set; the set 
of such expressions is defined by the children AND nodes of the OR node, and their inputs. We 
shall refer to the OR-nodes as equivalence nodes henceforth. 

4.1 Representing a Single Query 

A given query is initially represented directly in the AND-OR DAG formulation. For example, 
the query tree of Figure |l](a) is initially represented in the AND-OR DAG formulation, as 
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shown in Figure [T|(b). Equivalence nodes (OR-nodes) are shown as boxes, while operation 
nodes (AND-nodes) are shown as circles. 

The initial AND-OR DAG is then expanded by applying all possible transformations on 
every node of the initial query DAG representing the given set of queries. Suppose the only 
transformations possible are join associativity and commutativity. Then the plans A N (B N C) 
and (A N C) N B, as well as several plans equivalent to these modulo commutativity can be 
obtained by transformations on the initial AND-OR- DAG of Figure |l](b). These are represented 
in the DAG shown in Figure [l](c) . We shall refer to the DAG after all transformations have 
been applied as the expanded DAG. Note that the expanded DAG has exactly one equivalence 
node for every subset of {^4, B, C}; the node represents all ways of computing the joins of the 
relations in that subset. 



4.2 Representing Sets of Queries in a DAG 

Queries are inserted into the DAG structure one at a time. When a query is inserted, equivalence 
nodes and operation nodes are created for each of the operations in its initial query tree. Some 
of the subexpressions of the initial query tree may be equivalent to expressions already in the 
DAG. Further, subexpressions of a query may be equivalent to each other, even if syntactically 
different. For example, query may contain a subexpression that is logically equivalent to, 
but syntactically different from another subexpression of the query (e.g., (A XI B) CXI C, and 

Before the second subexpression is expanded, the DAG would contain two different equiv- 
alence nodes representing the two subexpressions. [ RSSBOtifl modifies the Volcano DAG gen- 



eration algorithm such that whenever it finds nodes to be equivalent (in the above example, 
after applying join associativity) it unifies the nodes, replacing them by a single equivalence 
node. The Volcano optimizer [ |GM91 ] already has a hashing-based scheme to efficiently detect 



repeated expressions, thereby avoiding creation of new nodes that are equivalent to existing 
nodes. The extension of [ RSSBOtifl additionally unifies existing logically equivalent nodes. An- 



other extension is to detect and handle subsumption derivations. For example, cta<5(E) can 
be computed from cta<io{E) if they both appear in a set of queries. Similarly, if we have 
aggregations dnoG SU m(Sai) ( E ) and ag eGsum(Sai){E), we can introduce a new equivalence node 
dno,ageG S um(Sai){E) and add derivations of the other two from this one. For more details of 



unification and subsumption derivations involving selections and aggregation, see [RSSBOO]. 



4.3 Physical Properties 

It is straightforward to refine the above AND-OR DAG representation to represent physical 
properties | |GM9f ], such as sort order, that do not form part of the logical data model, and 



obtain a physical AND-OR DAG []. The presence of an index on a result is also modeled as a 



physical property of the result by [RSSBOC], making the code that handles physical properties 

3 For example, an equivalence node is refined to multiple physical equivalence nodes, one per required physical 
property, in the physical AND-OR DAG. 



10 



also perform index selection. Physical properties of intermediate results are important; for 
example, if an intermediate result is sorted on a join attribute, the join cost can potentially be 
reduced by using a merge join. This also holds true of intermediate results that are materialized 
and shared. Our implementation indeed handles physical properties, including sort orders and 
indices, but to keep the description simple we do not explicitly consider physical properties. 

5 Finding Optimal Plans 



We first outline how to find optimal plans for queries, following [RSSB00|, and then outline 
extensions to find optimal plans for view maintenance. In both cases, we assume that the set 
of views chosen for materialization is fixed. In Section [6| we outline how to integrate the choice 
of views to materialize with the choice of optimal plans for view maintenance. 

5.1 Finding Optimal Plans for Queries 

The Volcano optimization algorithm finds the best plan for each node of the expanded DAG by 
performing a depth first traversal of the DAG. Costs are defined for operation and equivalence 
nodes. The computation cost of an operation node is o is defined as follows: 

compcost(o) = cost of executing (o) + H e . echi i dren r \compcost(ei) 
The children of o (if any) are equivalence nodes. The computation cost of an equivalence node 
e is given as 

compcost(e) = min{compcost(oi)\oi E children(e)} 
and is if the node has no children (i.e., it represents a relation) Note that the cost of 
executing an operation o also takes into account the cost of reading the inputs, if they are not 
pipelined. 

Volcano also caches the best plan it finds for each equivalence node, in case the node is 
re- visited during the depth first search of the DAG. 

A simple extension of the Volcano algorithm to find best plans given a set of materialized 



views is described in [RSSBOC]. We outline this extension below. 

Let reusecost(m) denote the cost of reusing the materialized result of m, and let M denote 
the set of materialized nodes. 

To find the cost of a node given a set of nodes M have been materialized, we simply use 
the Volcano cost formulae above for the query, with the following change. When computing 
the cost of an operation node o, if an input equivalence node e is materialized (i.e., in M), the 
minimum of reusecost(e) and compcost(e) is used for computing compcostio). Thus, we use 
the following expression instead: 

compcost(o, M) = cost of executing (o) + S e . 6fA j Wren ( o) C(ej, M) 
where C(ej,M) = compcost(ei) if ej ^ M 

= min(compcost(ei, M),reusecost(ei)) if ej E M. 



4 Relation scans are explicitly represented as an operation and assigned a cost. 
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and compcost for equivalence nodes is defined as before. Thus, the extended optimizer computes 
best plans for the query in the presence of materialized results. The extra optimization overhead 
is quite small. 

5.2 Extending the DAG Structure for Computing Differentials 

We now outline how to extend the DAG structure to represent the differentials of a set of ex- 
pressions. We first construct the expanded DAG for the given expression (or set of expressions). 
As in Volcano, each equivalence node in the DAG has a set of logical properties such as the 
schema of the expression result, and estimated statistics about the result such as number of 
tuples. Once the best plan is computed for a node, it is cached in case it is needed later during 
optimization. 

If there are n relations, R\, . . . , R n , we need to store information about the differentials of 
the node with respect to 6^ , 5^ , 5~^ 2 , 5~^ 2 , and so on until <5^ , 5~^ n . We number these updates 
as 1, . . . , 2n, and use these numbers to identify the update. 

To optimize differential plans, each equivalence node stores information for the differentials 
of the expression with respect to each update type, in addition to information about the full 
result. Each equivalence node e therefore stores an array of 2n records, as below. Each odd 
numbered entry 2i — l,i = l..n, of this array contains: 

1. logical properties (such as schema and estimated statistics) of the differential of e with 
respect to inserts on Ri 

2. the best plan for computing the differential of e with respect to inserts on R{ 

3. the logical properties of the full result of the equivalence node after inserts and deletes to 
relations Ri, . . . , R-i-\ have been propagated 

Similarly, each even numbered entry 2i,i = l..n, of this array contains similar information on 
differentials and best plans with respect to the deletes on Ri, and the logical properties for the 
full result of the equivalence node after inserts and deletes to relations R±, . . . , Ri-i, and inserts 
to Ri have been propagated. 

In addition, as in the original representation, each node stores the best plan for (and cost 
of) recomputing the entire result of the node after all updates have been made on the base 
relations. 

The logical properties of the differentials are computed by a bottom-up traversal of the DAG. 
We describe later how the best plans for computing differentials are computed and stored. If 
an equivalence node does not depend on relation Ri, we flag this during the above bottom-up 
traversal, and set the plans in entry 2i and 2i + 1 to be null. 

The traversal also computes and stores an estimate of the execution cost of the differential 
version of each operation in the DAG (such as a join or an aggregation). The properties of the 
differentials of its inputs, as well as the full version of the input, where required, are used to 
compute this estimate. 
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5.3 Finding Optimal Plans for Updates 

We now outline how to find optimal plans for updates, using the above mentioned DAG rep- 



resentation. Recall the example from Section 3.2.3, with the expression being A N (B N C) 



and an insert on A, the plan [8\ N B) N C, is likely to be more efficient than <Tt N (£? N C), 
and should be considered. Luckily for us, the DAG representation of the query represents 
(A N B) M C in addition to i M (B M C) (see Figure |). 



We now extend the technique for finding optimal plans for queries described in Section 5.1 
to find the optimal way of propagating the differential 5\ . 

Some equivalence nodes do not depend on some relations, and their differential with respect 
to the relation will be empty. Let diffChildren(o, i) denote all equivalence node children of o 
whose differential is non-empty on update i, and fullChildren(o, i) denotes all children of o whose 
full results are required to compute the differential of o, in conjunction with diffChildren(o,i). 

For instance, diff Children for an operation that joins A with (B N C), with respect to an 
insert on B, is the node BMC, and correspondingly fullChildren of the node is A. 

Given an operation node o in the DAG, let 5(o, i) denote the differential of operation o with 
respect to update i. Also let localDiffCost(o, i) denote the cost of executing the operations in 
S(o, i), without counting the cost of generating its inputs. 

Similarly, for an equivalence node e, let 5(e, i) denote the differential result of e with respect 
to update i. Then, the total cost of generating the differential result of an operation node o 
with respect to update i, diffCost(o,i) can be computed by: 

localDiffCost{o,i) + ^ eje diffChildren(o,i) di ff Cost ( e j^) + ^efullChildren^if 0171 ? 008 ^) 

The cost of computing the differential of an equivalence node e with respect to update i is given 
as 

diffCost(e,i) = min{diffCost(oj,i)\oj £ children(e)} 
and is if the node has no children (i.e., it represents a relation or a relation differential). The 



definition of compcost is as defined earlier in Section 5.1, and represents the cost of recompu- 
tation of the node after updates have been performed on the database relations. 

The above formula is extended for the case where some nodes are materialized, as follows. 
Note that the full result of a node may be materialized, and independently, any of its differential 
results may also be (temporarily) materialized. Let the set of materialized results be M; For 
an operation node o, we compute diffCost(o, M, i) as: 

localDiffCost(o, i) + ^ eje diffChildren(o,i) C ( e J^ M > + ^ ej efullChUdren{o,i) C ^v M ) 

where C is defined as follows: if 5(e,i) is not materialized (i.e., not in M), 

C(e,M,i) = min{diffCost(oj,M,i)\oj £ children(e)} 
and if 5(e,i) is materialized (i.e., in M), 

C(e,M,i) = min(reusecost(e,i),min{diffCost(oj, M,i)\oj £ children(e)}) 
and reusecost(e, i) denotes the cost of reusing the materialized result of <5(e, i). Also, C(ej, M) 
plays the same role for the full result of node ej , as defined in Section |5.1| . 
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For an equivalence node e, diffCost(e, M,i) = min{diffCost(oj , M, i)\oj € children(e)} 
and is if the node has no children (i.e., it represents a relation or a relation differential). 
That is, diffCost represents the cost of computing the differential, even if the differential is 
materialized. 

For each equivalence node e, the operation node corresponding to the minimum cost in the 
above formula defines the (top node of the) best plan for 5(e,i), given that results in M are 
materialized. 

Further, we can compute the total cost of computing the differential of a node as 
totalDiffCost(e, M) = Sj=i...2n diffCost(e,M). 

We perform a single traversal of the DAG to compute the costs for each equivalence/operation 
node, based on the above equations. During the traversal we also cache the best (minimum 
cost) plan computed for each differential, just as we cache the best plans for each full result. 

Note that if both inputs to a join E\ M E2 are expressions using a common relation R, an 
update to R results in changes to both inputs, and as a result, the update expression for the 
join is (S~e X E2) U {{E\ U <5^ ) N 6g ). In this case, a join in the original expression has been 
converted into a union of two joins. The best plan for each join is found, giving the best plan 
for the entire expression, and this combined best plan must be stored (and used to compute 
the cost of finding the differential) . 

Optimizations that exploit knowledge of foreign key dependencies can be used to detect 
that certain join results involving differentials will be empty JQGMW96 , Vis98|. For instance, 



if r.B is a foreign key into s.A, then the join of 5f and r will be empty. Based on this, parts of 
the differential expression can be detected to be empty, and eliminated during optimization. 

6 The Greedy Algorithm for Selecting Materialized Views 

Till now we assumed that the set of materialized nodes is fixed. We now describe how to 
integrate the choice of extra materialized views/indices with the choice of best plans for view 
maintenance. Our algorithm is based on a greedy heuristic. We first present the basic algorithm, 
then some optimizations, and extensions, below. 

6.1 The Basic Greedy Algorithm 

As outlined earlier, we first take the given set of materialized views V, and build a DAG 
structure on the expressions defining the views. The nodes of the DAG corresponding to views 
in V are marked as already chosen for materialization. 

We consider both full and differential results for materialization. A result is identified by 
a node and an update number (in our implementation a full result is identified by the update 
number 0, and differential results by numbers 1 . . . In). 

If a result is chosen for temporary materialization, we must take into account the cost of 
computing it. And if it is chosen for permanent materialization, we must take into account the 
cost of maintaining it (we need not consider the cost of initial materialization since it is a one 
time cost). 
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Procedure Greedy 

Input: Expanded DAG for V, the initial set of materialized views, 

and the set of candidate equivalence nodes (and their differentials) for materialization 
Output: Set of nodes/differentials to materialized 

X = V 

Y = set of candidates equivalence nodes/differentials for materialization 
while (Y ^ </>) 

Pick the node x with the highest benefit(x,X) 

if {benefit(x,X) < 0) 

break; /* No further benefits to be had, stop */ 

Y = Y - x; X = X U {x} 
return X 

Figure 2: The Greedy Algorithm 

The cost of maintaining a node incrementally is the sum of the costs of its differentials: 
maintcost(n, M) = totalDiffCost(n, M) + mergeCost(n) 
where mergeCost(n) denotes the cost of updating the materialized result of n using the differ- 
entials. 

For a full result n, we define 

cost(n, M) = min(compcost(n, M) + matcost(n),maintcost(n, M)) 
where matcost{n) denotes the cost of writing out the computed result of n. That is, when 
finding the cost of the full result of a materialized node, we take the minimum of the cost via 
recomputation and the cost via computing the differentials. 

For a differential result x = 6(n,i), we define 

cost(x,M) = dijJCost(n, M,i) + matcost(x) . 
Given a set S of results (full/differential), let cost(S,M) be defined as 

cost(S,M) = T, qe scost(q, M) 
Given a set of results M already chosen to be materialized, and a result x, the benefit of 
additionally materializing x, benefit(x, M), is defined as: 

benefit(x, M) = cost(M, M) - (cost(M, {x} U M) + cost(x, M)) 
Note that (cost(M, {x} U M) + cost(x, M)) is equivalent to cost(M U {x}, M U {x}). 

Figure Q outlines a greedy algorithm that iteratively picks nodes to be materialized. The 
procedure takes as input the set of candidate results (equivalence nodes, and their differentials) 
for materialization. At each iteration, the node x that gives the maximum reduction in the cost 
if it is materialized, is chosen to be added to X. 

The procedure not only selects results for maintenance, but also decides on how they should 
be maintained. Specifically, for full results, it chooses the cheaper of recomputation (including 
the cost of storing the result), and differential computation (including the cost of performing 
the computed differential updates). If recomputation is cheaper for a result, and the result was 
not part of the given set of materialized view, the result can be materialized temporarily during 
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view maintenance, and discarded later. Differential results that are chosen to be materialized 
are materialized only temporarily since they are only used during view maintenance. 



6.2 Optimizations 

The greedy algorithm as described above can be expensive due to the large number of times 
the function benefit is called, (which in turn calls the expensive function costQ). 

Some important optimizations to the greedy algorithm for multi-query optimization are 
presented in iRSSBOOfl . We use two of the optimizations, with some extensions for handling 



differentials: 

1. There are many calls to benefit (and thereby to costQ) at line LI of Figure ^, with 
different parameters. A simple option is to process each call to cost independent of other 
calls. However, observe that the set of materialized nodes which is the second argument 
of cost changes minimally in successive calls — successive calls take parameters of the 
form cost(R, {x} U X), where only x varies. That is, instead of considering x\ U X for 
materialization, we are now considering storing X2^JX for materialization. The best plans 
computed earlier does not change for nodes that are not ancestors of either x\ or X2- It 
makes sense for a call to leverage the work done by a previous call by recomputing best 
plans only for ancestors of x\ and X2- 

A novel incremental cost update algorithm is presented in [ RSSB0C| l, This algorithm 



maintains the state of the DAG (which includes previously computed best plans for the 
equivalence nodes) across calls to cost, and may even avoid visiting many of the ancestors 
of x\ (which is unmaterialized) and X2 (which is materialized). 

In our context of finding update plans, we have to modify the incremental cost update 
algorithm slightly. 

(a) If the full result of a node is materialized, we update not only the cost of computing 
the full result of each ancestor node, but also the costs for the 2n differentials of 
each ancestor node since the full result may be used in any of the 2n differentials. 
Propagation up from an ancestor node can be stopped if there is no change in cost 
to computing the full result or any of the differentials. 

(b) If the differential of a node with respect to update i is materialized, we update only 
the differentials of its ancestors with respect to update i. Propagation can stop on 
ancestors whose differentials with respect to i do not change in cost. 

2. The monotonicity optimization works as follows. With the greedy algorithm as presented 
above, in each iteration the benefit of every candidate node that is not yet materialized 
is recomputed since it may have changed. 

The monotonicity optimization is based on the assumption that the benefit of a node 
cannot increase as other nodes are chosen to be materialized - while this is not always true, 
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it is often true in practice. The monotonicity optimization makes the above assumption, 
and does not recompute the benefit of a node x if the new benefit of some node y is 
higher than the previously computed benefit of x. It is clearly preferable to materialize 
y at this stage, rather than x — assuming monotonicity holds, the benefit of x could not 
have increased since it was last computed, and it cannot be the node with highest benefit 
now, hence its benefit need not be recomputed now. 

Thus, recomputations of benefit are greatly reduced. 



[ RSSBOO l presents a third optimization based on potential sharability of nodes between 
queries. This optimization is not relevant here, since even a node that is not sharable may be 
worth materializing permanently. 

For the purpose of this paper, the details of the above optimizations are not critical, but 



the interested reader may refer to [RSSBOC] for details. 

The Greedy procedure can be extended in a straightforward manner to consider a workload 
of queries, along with periodic updates, and to choose the best set of results (and indices) 
to materialize, to minimize the cost of the queries and view update. Some optimizations are 



needed to handle large workloads [ RSSBOO l. We can also introduce limits on space for storing 
permanently materialized results and temporarily materialized results. Results can then be 
materialized in the order of benefit per unit space, instead of just benefit. 

7 Performance Study 

We implemented the algorithm described earlier for finding optimal plans for view maintenance. 
Like the existing multiquery optimization code, the new code implements index selection along 
with selection of results to materialize. Our current implementation has a restriction in that it 
only considers full results for materialization, although a version which also considers differential 
results for materialization should be ready shortly. Thus our estimated benefits are actually 
conservative, and we may be able to get even better results once the full implementation is 
ready. However, the benefits are already very significant. 

7.1 Performance Model 

We used a benchmark consisting of TPC-D queries (and some variants based on the same 
TPC-D schema). The performance measure is estimated execution cost, called plan cost in the 
performance graphs. Our cost model extends the cost model used in the multiquery optimizer, 
by taking differential computation into account. The cost model used takes into account number 
of seeks, amount of data read, amount of data written, and CPU time for in-memory processing. 
Since we do not currently have a query execution engine which we can extend to perform 
differential view maintenance, we are unable get actual numbers. However, the cost model is 
fairly sophisticated, and further, benefits for multiquery optimization predicted by the basic 
cost model have been verified by running rewritten queries on commercial database systems 
( |RSSB00 ], and results in a companion paper on query result caching), giving support to the 



accuracy of estimated benefits. 



17 








10 20 30 40 50 60 70 80 
Update Percentage 

(a) without aggregation 







10 20 30 40 50 60 70 80 
Update Percentage 

(b) with aggregation 



Figure 3: Maintaining Stand-alone Views 

We provide performance numbers for different percentages of updates to the database rela- 
tions; we assume that all relations are updated by the same percentage. To model a growing 
database, we have twice as many inserts as deletes. In our notation, a 10 percent update to a 
relation consists of inserting 10% as many tuples are currently in the relation, and deleting 5% 
of the current tuples. 

We compare the performance of our greedy algorithm (referred to as Greedy in our discussion 
and figures) with plain Volcano query optimization extended to choose between recomputation 
and incremental maintenance of views (referred to as NoGreedy). (The algorithm of Vis98|| 
falls in the same class as NoGreedy, although the optimization method is somewhat different.) 
For each query, we present results at different update percentages, ranging from 1% to 80%. 

The cost of view maintenance is affected by the presence of indices. Normally, databases 
have indices on the primary key attributes of each relation, to check for uniqueness. Hence we 
assume that for each of the TPC-D relations, an index is present on the primary key attributes. 
However, we also ran our benchmark assuming that no indices are initially present, and found 
that all required indices got chosen for permanent materialization by our algorithm. Thus, the 
cost of the plans we generate were not significantly affected by the presence of indices, although 
the cost of plans without our optimizations rose if indices were not already present. 

We assume a TPC-D database at scale factor of 0.1, that is the relations occupy a total of 
100 MB. The buffer size is set at 8000 blocks, each of size 4KB, although we also ran some tests 
at a much smaller buffer size of 1000 blocks. The tests were run on an Ultrasparc 10, with 256 
MB of memory. 
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Figure 4: Maintaining a Set of Views of Same Class 



7.2 Performance Results 

Maintaining Individual Views. Figure ^ shows our results on two queries, the first consisting 
of the join of 4 relations, without aggregation, and the second consisting of aggregation on the 
same join. As can be seen from the figures significant benefits are to be had, especially at low 
update percentages, but there are benefits even at relatively high update percentages. 
Maintaining a Set of Views. Figure |] shows our results on two sets of queries, the first 
containing five queries without aggregation and the second containing five queries with aggre- 
gation. The benefit ratio due to Greedy is again excellent at lower update percentages. There 
is a jump in cost at one point, which is because of the use of an algorithm that depends on an 
input fitting in memory, and when the input does not fit in memory its cost increases sharply. 

Figure ^ shows the results on a set of 10 queries, with indices already present on all primary 
key attributes, and without any indices present initially; all required indices got chosen for 
materialization. 

Cost of Optimization. For a set of 10 materialized views, each a join of 3 to 4 TPC-D 
relations, (whose results are shown in Figure |5|), the time for Greedy optimization was 31 
seconds. Note however that 31 seconds is low compared to the savings of up to 1000 seconds 
obtained for one run of view maintenance, and besides it is a one-time cost whereas view 
maintenance is typically at least a daily task in a data warehouse. Thus, the extra cost for our 
algorithms is worthwhile. 

The number of candidates for materialization grows exponentially with the number of re- 
lations in a query. We are currently working on techniques to prune the set of candidates, in 
order to keep optimization time in tight control even with a higher number of relations. 
Temporary vs. Permanent Materialization. Out of a total of 1600 results that were 
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Figure 5: Maintaining a Large Set of Views 



materialized (totalling across many different queries and query sets that we considered, and 
across update percentages ranging from 1 percent to 90 percent), we found that for about 1000 
the recomputation cost was less, meaning they are materialized temporarily, and for 600 the 
maintenance cost was less, meaning they were materialized permanently. At 1 to 5 % update 
rates, the ratio was 281 to 306, while at 50 to 90 % update rates, the ratio changed to 360 to 
88, in favor of recomputation. 

Effect of Buffer Size. With a buffer size of 1000 blocks (instead of 8000 blocks), we found 
that the costs of plans with and without Greedy optimization went up, but the increase was 
more for recomputation plans and the benefit ratio for small update percentages was actually 
more strongly in favor of our algorithms. 

8 Conclusions and Future Work 

The problem of finding the best way to maintain a given set of materialized views is an important 
practical problem, especially in data warehouses/data marts, where the maintenance windows 
are shrinking. We have presented solutions that exploit commonality between different tasks in 
view maintenance, to minimize the cost of maintenance. Our techniques are easy to implement 
on an existing multiquery optimizer. As shown by the results in section [?], our techniques can 
generate significant speedup in view maintenance cost, and the increase in cost of optimization 
is acceptable. We therefore believe that our results are a timely solution to an important 
practical problem. 

Future work includes further heuristics to decrease optimization cost, and implementing 
extensions to efficiently handle workloads containing queries. We also plan to port the system 
to a dynamic query result caching environment; in a companion paper, we study the issue of 
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selecting results to cache dynamically, in the absence of updates. 
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